Loading...
 

Sequencing (3rd Wave)


There are three generations of sequencing. The original, first generation Capillary Electrophoresis (or CE) with its better known form of Sanger Sequencing. The second or Next Generation Sequencing (NGS for short) that is dominant and the focus here. And the still developing third generation Long Read Sequencing. These three generations of sequencing are not to be confused with the three waves of test methods used in genetic genealogy.

The majority of focus here is on the dominant Next Generation Sequencing that is used for the third wave of genetic genealogy. The first generation CE has its own page but is described a little more below. For now, the emerging 3rd generation is described at the of this page. The Wikipedia articles referenced below are simplistic, self-consistent explanations with diagrams and are a good source for exploring the sequencing field further. Other references cover specific topics in more detail.

Sanger Sequencing

Before NGS came about, Sanger Sequencing was the original sequencing form (1st generation). It had been around since the 1980's and used on the original Human Genome Project (HGP) to create the first human genome reference model in the 1990's. Sanger Sequencing is actually a specific form of the more general Capillary Electrophoresis technique.

Sanger Sequencing and 2nd wave microarray testing both require primers to find the area of interest that needs to be tested. So they are really more useful for verifying known values than investigating new markers in DNA. Sanger Sequencing is one of the main, first generation sequencing techniques (the first in the chain-terminated techniques). Frame (length) Analysis, another capillary electrophoresis technique, is still the mainstay of yDNA STR testing at FTDNA and YSEQ.

Next Generation Sequencing

Next Generation Sequencing (or NGS for short) is an advanced form of genetic testing for base pair values that extracts all variants, both known and unknown, from the DNA. Therefore, it is the most useful for finding new markers to finely distinguish between groups like no other approach before. All previous generations and waves of testing were providing values for known variants or areas of the genome.

NGS is also known as High Throughput (HTS) and Massively Parallel (MPS) sequencing due to the millions of values returned each cycle during testing. This as opposed to just a few hundred thousand total, in one cycle, with microarray testing and a handful to hundreds with the original CE form of testing. NGS technology introduced the ability to do automated, all at once, Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES). And that is its main use both in the medical market and for genetic genealogy.

We consider the use of NGS to be the 3rd wave of genetic genealogy testing technology. Working to revolutionize what is available like the 2nd wave (microarray testing) did before and the pioneering 1st wave (CE technology) to start. It was the dramatic reduction in cost from hundreds to tens to under a thousand dollars that enabled the use of NGS in genetic genealogy. With the introduction of sub-$500 30x WGS testing in 2018, this is the new, affordable alternative and go-to for many in the industry. (Note: the FTDNA BigY-700 test, introduced in 2018, is using the same equipment as a 30x WGS test. But doing so in more of a WES mode to greatly reduce the cost.)

Unlike the "Microarray Testing" approach used for most autosomal and matching segment analysis, this form is a quasi-direct, full-sequencing approach. NGS performs reads of actual base-pair sequences for ALL the DNA in the cell. The full 3+ billion base-pairs of the haploid or really double that as it returns the full ploidy. Key is that NGS is usually PCR free (unless done in WES mode) and thus returns more reliable results free of PCR duplication errors and stutter that may be introduced into STRs.

The main implementation of NGS used today is what is termed a massively-parallel, shotgun, paired-end, short-read sequencing. Like microarray testing, it can return many values from many different testers in each cycle. And thus what has lowered its cost and enabled its use in Direct-to-Consumer (DTC) testing. Both microarray and NGS use more of a 2D array approach as opposed to the linear, capillary form of the original 1st wave CE. (Although panels of multiple capillaries are used in modern CE testing.)

Image
WGS Bioinformatics Pipeline
The current NGS approach breaks the DNA strand up into millions of short segments. And then performs direct reads of the many copies of each short segment in parallel. Each segment is from roughly 200 to 500 base-pairs in length, on average, and dependent on the technique and approach used by the lab. (The goal is to have as long a read segment as possible / feasible.) Unlike the Microarray Testing approach, NGS generally does not use PCR duplication. Only sparingly if at all as it inherently introduces error. Thus relying more on the many original copies of DNA in the sample. They often try to achieve 30 or more overlapping segment reads per base-pair so any read errors are easily determined. What comes out of the sequencer are paired 150 base-pair long segments of DNA. Paired because they are reading from each end of the strand. This processed sequencer output is held in a FASTQ file and delivered as the result. See the description of Sequencing File Formats for more information.

There is a way to process a paired-end test to deliver the matched pairs as longer single-end reads. While 150 base-pair paired-end is the gold standard, 400 base-pair single-end has been hitting the market using the same equipment that provides 200 base-pair paired-end reads.

With sequencing, unlike for microarray testing, there are many additional information processing steps to take to determining the SNP genetic marker values or end result. Microarray testing process a visual image of the test chip and outputs the determined SNP values directly; knowing what is being looked for at each location. With sequencing, there are many more steps needed as we do not know what is being viewed at each location. NGS involves a similar visual image — one for each base-pair read as it walks the DNA segment. There is much more processing to determine where that read base-pair belongs in the human genome reference model. Key is, the steps taken during sequencing and after, and with specific data libraries, can vary depending on what questions you are looking to answer. Microarray testing looks at specific, known points in the DNA. Sequencing, as described here, is simply reading values presented to it without knowing where in the DNA it lies.

The processing of the sequence data uses what is termed a bioinformatics pipeline and is shown here. Detailed are key, major steps of the intermediate data and tools. The subsection of Bioinformatics that is diagrammed here is termed sequence analysis,

The first major step in the bioinformatics pipeline is the read segments are mapped into the human genome model (called aligning). The next major step is determining the values of each SNP read (known as pileup and variant calling). A full-read of each complete chromosome is not always possible with this technique but a near complete coverage of the current human genome model is usually obtained. The breadth of coverage of the whole genome usually exceeds 99.95%.

A key factor in this process is knowing a human genome reference model in order to place the read short-segments onto the scaffold of the model. It is nowhere near as easy as a jigsaw puzzle. Because although our DNA is roughly 99.9% the same between all humans, there can be major areas of similarity that are thousands to tens of thousands of base-pairs long. Once mapped, the variants from the reference are determined. These variants are then compared against databases of studied variants. This final step is necessary for health and medical analysis and used for haplogroup formation in genetic genealogy. Like before, in genetic genealogy, we are looking to compare individuals tested against each other. But unlike before, this test technique can be used to find new, never before discovered variants between individual testers.

Key parameters from the NGS laboratory are the segment read length (100, 150, 200 etc. base-pairs) and the overall mapped (average) read depth (30x, 50x, etc; or basically how many times a given base-pair appears in the result file, on average). A general rule of thumb is the human genome needs 600 million 150 base-pair long reads or 90 gigabases of read data to achieve a RAW read depth coverage of 30x. And then often a 95% or better mapping quality to carry through to analyzable results. Another key parameter is the insert size (or fragment length) and how close that is to the read segment length (or 2x that if paired-end).

In general, while NGS offers a rapid whole-genome sequencing method, it suffers a limitation due to the short length of the segments it reads. Many STR markers and other just naturally occurring repeat or similar variant value regions require a longer read length to reliably capture their value. These are still only reliably read using the original Sanger Sequencing method that tends to work well with upwards of 500 to 800 base-pair read lengths. But even Sanger Sequencing has issues if there is too much variation possible in a given area of the genome. It relies on having reliable nearby target molecules to identify the area to sequence (the primer). Also problematic for the current NGS technique are sequences of 300 base-pairs or less that are not unique across the whole haploid genome and thus cannot be reliably mapped to a specific chromosome and region. Luckily these regions tend not to have SNPs that can be used to compare different testers for matching.

Whole Genome and Exome Sequencing

Whole Genome Sequencing (WGS) is an all-inclusive DNA NGS test covering all the human DNA. 30x average read depth coverage is the clinical standard for WGS testing but often as low as 15x can be good enough if extra care is taken in the laboratory and the mapping of the result to the human genome is high. (That is, a minimal contamination of the sample DNA occurs. NGS is more susceptible to contamination because it is sequencing all the DNA it finds; not knowing the source of where it originates. And not using primers to find specific, known DNA segments.)

Aligned read segment BAM files can be post-processed to extract many different markers — both SNP and STR in most cases, as well as more specialized InDels and other variants. STR markers have to be short and consistent enough to be covered by the short segments used in this test technique. Longer repeats or structural variations that overlap the short segment boundaries cannot be reliably determined and typical 1st generation Frame Analysis via Capillary Electrophoresis must occur.

A typical WGS test will return well over 3 million variants for a given tester from a database of over 30 million known variants in the human genome reference. This compared to around 60,000 variants returned by microarray tests (about 10% of the returned values tested will turn out to be variant from the reference model). NGS testing on the yDNA has led to the discovery of hundreds of thousands of new SNP markers.

Fall 2018 saw the first price drop of a WGS test to below $500 and even below $200 for flash sales over the years. This direct-to-consumer (DTC) sequencing market is so new that testers have to do a lot of the analysis work themselves to benefit from the NGS test in a genetic genealogy context. After that, the "callable" coverage of the base-pair sequenced across all the DNAchromosomes and mtDNA — is available for use in autosomal match analysis or placing on a phylogenetic tree (that is, yDNA and mtDNA analysis). New tools or sites that take in Sequencing File Formats are required to achieve this analysis more automatically for consumers. But today, especially with the cheaper available tests, the consumer has to process the files themselves.

The leading early players in the consumer market applying NGS are FamilyTreeDNA for a restricted Y only; and Full Genomes and YSEQ for whole genome (aka WGS) as well as targeted yDNA only. There are now the new entrants of Dante Labs and Nebula Genomics for the whole genome sequencing with Sequencing com also gaining a foothold. WGS results include the autosomal, Allosome and mtDNA DNA like traditional 2nd wave genetic genealogy microarray testing. Post-processing (and match) sites for NGS testing are available at Sequencing, yFull, FGC and YSEQ (and of course FTDNA for their own, in-house BigY test) with more coming online. yDNA-Warehouse is expected sometime in 2022.

For comparison, traditional Microarray Testing produces around 600,000 to 800,000 SNPs from the haploid (double that as two values per SNP are actually returned for the diploid autosomes). NGS will generally return well over 3 million true, variant SNPs in the autosomes alone. But many more billions of actual base-pair values read. (Microarray testing returns more like a gVCF file of the value for each SNP read; whether a derived or ancestral variant. NGS testing is returning a reliable read value for 99.97+% of all base-pairs that exist in your DNA and so 6 billion plus values. Hence why some advertise that their NGS test returns 10,000 times more values than microarray tests. Although not an apples to apples comparison as microarray testing. returns 2x as the values are diploid. So more like 650,000 diploid for microarray testing versus 3.1 billion for NGS or about 5,000 times more

Microarray testing returns custom file formats of SNP results that are termed RAW File Format. NGS test results tend to be delivered in standard VCF, BAM and maybe even FASTx format files of 1 to 150 GB in size. It is important to understand the Genome Build of any file that is mapped / aligned when delivered; no matter what the test format. This is important to the RAW File Formats too but most are using a common Build37.

Whole EXOME Sequencing (WES) is often applying Whole Genome Sequencing (WGS) techniques (and the same equipment) and simply pre-processing the sample to enhance the genes in the Exome before the actual sequencing run. The idea being, for medical reasons, you may only care about the changes that are actually contained within a known gene. And thus can get more reads from the important areas with less effort. This technique is not really useful or applicable to genetic genealogy. While SNPs in the Exome are important, they are no more or less important than SNPs elsewhere in the DNA. FTDNA BigY and FGC yElite are yDNA NGS tests but using a WES technique of enrichment to reduce cost and yield of the sequencing equipment.

Long-read Sequencing

There are other forms of sequencing being developed now as well. Tagged or linked-read sequencing (often termed 2.5 generation) is really an extended form of NGS. Also known as bar-coded. A true 3rd generation sequencing that is still in its development stages is termed long-read sequencing. This overcomes the limitations for highly repetitive areas of the genome that the short-reads of NGS cannot resolve. Leading to a true de novo assembly of a persons whole genome. Around 5% of the human genome is so repetitive that short-read, shotgun sequencing cannot discern it. Over 50% of the yDNA chromosome falls into that category. Long-read sequencing still processes all the DNA like done with WGS and WES already explained above.

Long Read sequencing is an emerging technique to extend NGS to longer read lengths and often based on radically new technology. Full Genomes and Dante Labs were offering early, long-read products to the genetic genealogy community based on the Oxford Nanopore technique. But it has not proven useful due to shorter-than-desired read lengths or highly inaccurate read values. Long read has historically been fraught with too many read errors to make it accurate enough for medical and even genealogical purposes. But equipment, chemistry and analysis is improving quickly.

True Full Sequencing with longer reads is still prohibitively expensive for genetic genealogy purposes and only possible with hybrid and customized Sanger sequencing approaches. UPDATE: As of December, 2022 and the commercial introduction of the PacBio HiFi CCS sequencer, long read sequencing is now being offered by Dante Labs. Initially for under $1,000. but with such a low coverage as to not make it as useful. But now $10,000 before the coverage is likely useful here.

The nascent technology of long-read sequencing (what some term the third generation of sequencing) is being made more reliable with further research. (That is, some in the genetic genealogy community are kicking the tires as kinks are worked out to make it usable.) The promise, if it pans out, is that sequences of tens of thousands to millions of base-pairs will be reliably read at once. Not just 100 to 3000 with current NGS techniques. Thus breaking through large base-pair disturbances of the genome, long sequence STR markers, and detecting when large multi-count STR markers are disturbed by a single SNP change in the middle. No common acronym for this third generation is in use yet.

There is some work on what is being termed Linked Read or "barcoding" Sequencing as a specific technique. This is not to be confused with Long Read Sequencing already mentioned above. It basically breaks the DNA into longer segments (to to ten or so kilo-bases in length this time). Then, separately breaks those up to current short-read segments. But it puts unique tags for each of the longer sequences before being broken up further. This unique tagging is already done with the short-read segments to allow the mix of different samples in the same lanes per run or a sequencer. The tagged results are then sorted out before further processing so more samples can be sequenced at the same time. For "barcoded", instead of uniquely tagging separate samples, they are uniquely tagging long strands within a given sample and then using traditional NGS short-read, paired-end, massively-parallel technology to read the short segments created from the longer, "barcoded" ones. They then use the "barcodes" to re-assemble the longer segments of DNA. The results is much more reliable reads of longer sequences. Maybe can can call this "LRS" 2.5 generation sequencing?

Key Terms

Listed here are key terms to understand about sequencing that are not yet explained anywhere else.

read length
The length of a fragment of DNA that can be read by the sequencer before reading just becomes too unreliable. For NGS (short-read)) situations.

insert size
Also termed the fragment length. The rough median size in base-pairs of a fragment of DNA inserted into the sequencer.

fragment
A portion of the DNA that is supplied to a sequencer. Often an attempt is made to get as close to similar size fragments of the original DNA strands as possible.

short-read
A term mostly applied to NGS technology that can only read short numbers of base-pairs before errors become too great. Hence why often done in paired-end mode as well. At one time only 50 base-pairs. More standardly at 150 base-pairs today. Some examples up to 250 base-pairs have been accomplished in more of a research setting.

long-read
A term defining the third generation of sequencing where read lengths far exceed the previous limits of sangar sequencing. Read lengths in the tens of thousands with NGS style reliability are available now. ONT using nanopore technology regularly has a hundred thousand reads or more. Research labs have been able to exceed one million reads on nanopore machines.

paired-end
A technique that is part of short read NGS where a fragment is read first from one end and then the other. Similar to when two people such on either end of a strand of spaghetti. The sequencer can only read as far as the read length into a fragment. So the goal is for the insert size to be as close to double the read length so the whole fragment is read.

single-end
Normally what is done in Sangar Sequencing. Also a technique for post processing NGS paired-end results with correct insert size to create single end results.

read depth
The measure of how many times a particular base-pair has been read in an NGS style sequencing. Often estimated by dividing the whole genome base-pair count by the number of base-pair values returned from a sequencing run. And thus reported as the average read depth across the whole genome.

read coverage
Most often mean to be the breadth of coverage of the genome in a WGS result. That is, what percentage of the genome base-pairs have been read at least once or some set minimum number of times. Some mean the depth of coverage when using this term. But really the read depth should be used there.